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Abstract 

We address the non-redundant random generation of k words of length n in 
a context-free language. Additionally, we want to avoid a predefined set of 
words. We study a rejection-based approach, whose worst-case time complexity 
is shown to grow exponentially with k for some specifications and in the limit 
case of a coupon collector. We propose two algorithms respectively based on the 
recursive method and on an unranking approach. We show how careful imple- 
mentations of these algorithms allow for a non-redundant generation of k words 
of length n in 0(k ■ n ■ logn) arithmetic operations, after a precomputation of 
0(n) numbers. The overall complexity is therefore dominated by the generation 
of k words, and the non-redundancy comes at a negligible cost. 

Keywords: Context-free languages; Random generation; Weighted grammars; 
Non-redundant generation; Unranking; Recursive random generation 



1. Introduction 

The random generation of combinatorial objects has many direct applica- 
tions in areas ranging from software testing [j| to bioinformatics [13] ■ It can help 
formulate conjectures on the average-case complexity of algorithms raises 
new fundamental mathematical questions, and motivates new developments on 
its underlying objects. These include, but are not limited to, generating func- 
tionology, arbitrary precision arithmetics and bijective combinatorics. Follow- 
ing the recursive framework introduced by Wilf [24j . very elegant and general 
algorithms for the uniform random generation have been designed [Trj ] and im- 
plemented. Many optimizations of this approach have been developed, using 



specificities of certain classes of combinatorial structures [17[ , or floating-point 
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arithmetics More recently, Boltzmann sampling [ijj], an algebraic approach 
based on analytic combinatorics, has drawn much attention, mostly owing to 
its minimal memory consumption and its intrinsic theoretical elegance. 

For many applications, it is necessary to depart from uniform models 
A striking example lies in a recent paradigm for the in silico analysis of the 
folding of Ribo- Nucleic Acids (RNAs) . Instead of trying to predict a conforma- 
tion of minimal free-energy, current approaches tend to focus on the ensemble 
properties of realizable conformations, assuming a Boltzmann probability dis- 
tribution Q on the entire set of conformations. Random generation is then per- 
formed, and complex structural features are evaluated in a statistical manner. 
In order to capture such features, a general non-uniform scheme was introduced 
by Denise et al 7], based on the concept of weighted context-free grammars. 
Recursive random generation algorithms were derived, with time and s pac e 
complexities equivalent to that observed within the uniform distribution [l™. 
This initial work was later completed toward general decomposable classes [6j 
and a Boltzmann weighted sampling scheme, used as a preliminary step within 
a rejection-based algorithm for the multidimensional sampling of languages Q. 

In a weighted probability distribution, the probability ratio between the 
most and least frequent words typically grows exponentially on the size of the 
generated objects. Therefore a typical set of independently generated objects 
may feature a large number of copies of the heaviest (i.e. most probable) objects. 
This redundancy, which can be useful in some context, such as the estimation 
the probability of each sample from its frequency, is completely uninformative 
in the context of weighted random generation, as the exact probability of any 
sampled object can be derived in a straightforward manner. Consequently it 
is a natural question to address the non-redundant random generation of 
combinatorial objects, i.e. the generation of a set of distinct objects. 

The non-redundant random generation has, to the best of our knowledge, 
only been addressed indirectly through the introduction of the PowerSet con- 
struct by Zimmermann [26]. An algorithm in 0(n 2 ) arithmetic operations, or a 
practical 0(n 4 ) complexity in this case, was derived for recursive decomposable 
structures. The absence of redundancy in the generated set of structures was 
achieved respectively through rejection or an unranking algorithms. Unfortu- 
nately, these approaches do not transpose well to the case of weighted languages. 
Indeed, the former rejection algorithm may have exponential time-complexity 
in the average-case, as is shown later in the article. The unranking approach 
benefits from recent contributions by Martinez and Molinero [19], who gave 
generic unranking procedures for labeled combinatorial classes, generalized by 
Weinberg and Nebel [23| to rule-weighted context-free grammars. However, the 
latter algorithm is restricted to integral weights, and requires a transformation 
of the grammar which may impact its complexity. Furthermore, the question of 
figuring out a rank which avoids a set of words was completely ignored by these 
works. 

In this paper, we address the non-redundant generation of words from a 
context-free language. We remind or introduce in Section [2] some concepts and 
definitions related to weighted languages, and define our objective. In Section[3l 
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we analyze the shortcomings of a naive rejection approach. We show that, al- 
though well-suited for the uniform distribution, the rejection approach may lead 
to prohibitive average-case complexities in the case of degenerate grammars, 
large sets of forbidden words, large weights values, or large sets of generated 
words. Then, in Section 21 we introduce the concept of immature words, which 
allows us to rephrase the random generation process as a step-by-step process. 
The resulting algorithm is based on the recursive method, coupled with a cus- 
tom data structure to perform a generation of k sequences of length n at the 
cost of 0(k ■ nlog(n)) arithmetic operations after a precomputation in 0(n) 
arithmetic operations. We also propose in Section [5] an unranking algorithm for 
weighted grammars which, coupled with a dedicated data structure that stores 
and helps avoid any forbidden word, also yields a 0(k ■ n log(ra)) algorithm after 
0(n) arithmetic operations. We conclude in Section [6] with a summary of our 
propositions and results, and outline some perspectives and open questions. 

2. Notations and concepts 

2.1. Context-free grammars 

Let us remind, for the sake of completeness, some basic language-theoretic 
definitions. A context-free grammar is a 4-tuple Q — (E, A/", V, S) where 

• E is the alphabet, i.e. a finite set of terminal symbols. 

• Af is a finite set of non-terminal symbols. 

• V is the finite set of production rules, each of the form N — )• X, for N E Af 
any non-terminal and X G {E U Af}* . 

• S is the axiom of the grammar, i. e. the initial non-terminal. 

A grammar Q is then said to be in Binary Chomsky Normal Form (BCNF) 
iff each of its non-terminals N G Af is productive and can only be derived using 
a limited number of production rule (two for union type non-terminals, and one 
otherwise) : 

• Product type: N —> N' . N" with N' , N" G Af; 

• Union type: N ->■ N' \ N" with N', N" G Af; 

• Terminal type: N —> t with t G E; 

• Epsilon type: N — > s, iff N cannot be derived from self-referential non- 
terminals. 

In the following, it will be assumed that the input grammar is given in BCNF. 
This restriction does not cause any loss of generality or performance, as it can be 
shown that any Chomsky Normal Form grammar can be transformed in linear 
time into an equivalent BCNF grammar, having equal number of rule up to a 
constant ratio. 
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Let C(N) be the language associated to N £ E within a grammar Q, i.e. 
the set of words composed of terminal symbols that can be generated starting 
from N through a sequence of derivations. One has 

C(N') x C(N") lfN->N'. N" 

, C(N')UC(N") VN^N'\N" m 

' » {t} If JV->i lJ 

{e} If JV -> e 

The language £(<?) generated by a grammar £ = (Ti,Af,V,S) is then defined 
as £(5) the language associated with the axiom S. Finally, let us denote by £„ 
the restriction of a language £ to words of length n. 

2.2. Weighted context-free grammars 

Definition 2.1 (Weighted Grammar [7|). A weighted grammar Q n is a 5-tuple 
Qn = (tTj E, A/", 7-", <S) where (H,Af,V,S) define a context-free grammar and 7r : 
E — > R + is a weighting function that associates a real- valued weight irt to each 
terminal symbols t. 

This notion of weight naturally extends to any mature word w in a multi- 
plicative fashion, i.e. such that n(w) — n'='i 7r "';- ^ a l so extends additively 
on any set of words £ through 7r(£) = X)u>e£ n{w). One defines a 7r-weighted 
probability distribution over £ such that 

™, , „n 7r(W) TrfuO , , „ 

p(^|7t,£)= 1 „ = ^rry yweC - 2 

The random generation of words of a given length n with respect to a 
weighted probability distribution has been addressed by previous works, and 
an algorithm in Oinlogn) after 0(n 2 ) arithmetic operations was described [?J 
and implemented (201 ] . 

2.3. Problem statement 

In the following, we consider algorithmic solutions for the non-redundant 
generation of a collection of words of a given length, generated by an unambigu- 
ous weighted context-free grammar. Our precise goal is to simulate efficiently 
a sequence of independent calls to a random generation algorithm until a set 
of exactly k distinct words in a language £ are obtained. The returned subset 

C £ n , \1Z\ = k, can be generated in any order, and the random generation 
scenarios leading to an ordering a of 1Z can be decomposed as: 

(Ti ^ a{ ^ 0- 2 ^ (<7l I CT 2 )* -)■...->■ (Tfc-1 ->■ (<Tl | • • • | <7fc_l)* -> Ofc. 

The successive calls made to the weighted random generator are independent, 
therefore the total probability p a of getting a set 1Z in a given order cr is given 
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Algorithm 1 Non-redundant sequential meta-algorithm for the generation of 
k distinct words of length n, from a (weighted) context-free grammar Q K = 
(■K,Ei,J\f,V,S), avoiding a forbidden set of words T. 
NonRedundantSequential^Tr, k, n, J 7 ): 

Perform some precomputations. . . 

K <- 

while \1Z\ < k do 

x <— DrawNonRed(5„, Tr(N n ), Q n , J 7 ) {Any non-redundant algorithm} 

Update some data structure. . . 

(11, T) <r- (K U {x}, T U {x}) 
end while 
return 1Z 



by 

Summing over every possible permutation of the elements in 1Z, one obtains 

k 

P(K\k,n)= VJ 7J (3) 

<xe6(TC) i=l -(^O " Ej=l ^Oj) 

where (5 (7?.) is the set of all permutations over the elements of 1Z. The problem 
can then be restated as: 



Weighted-Non- Redundant- Generation (wnrg) 

Input: An unambiguous weighted grammar Q n and two positive integers 
n and k. 

Output: A set of words 1Z C C(Q) n of cardinality k with probability 
P(7e | k,n). 



Note that the distribution described by Equation ([3]) naturally arises from 
a sequence of dependent calls (ri, . . . , r*fc) to weighted generators for C, avoid- 
ing sets of words 0, {ri}, . . . , {r%, . . . , r^\} respectively, as implemented in 
Algorithm [T] It is therefore sufficient to address the generation of a single word 
w, while avoiding a prescribed set J 7 , in the weighted probability distribution 

P(w | TT,L\F). 
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Algorithm 2 Naive rejection algorithm for generating a word of length n, from 
a (weighted) context-free grammar Q Wy avoiding a forbidden set of words J- '. 
NaiveRejection(^, n, T): 
repeat 

t draw(C/ Tr , n) {One may use any available generation algorithm.} 

until t £ T 
return t 



3. Naive rejection algorithm 

A naive rejection strategy for this problem consists in drawing words at 
random in an unconstrained way, rejecting those from the forbidden set until 
a valid word is generated, as implemented in Algorithm [2] As noted by Zim- 
mermann [26j], this approach is suitable for the uniform distribution of objects 
in general recursive specifications. This rejection strategy relies on an auxiliary 
generator draw(- • • ) of words from a (weighted) context-free languages, and we 
refer to previous works by Flajolet et al [laLJJl, or Denise et al [8| for efficient 
solutions for this problem. 

Proposition 3.1 (Correctness of a naive rejection algorithm). Any word re- 
turned by Algorithm [£| is drawn with respect to the weighted distribution on 

c(g) n \T. 

Proof. Let w be the word returned by the algorithm, and T = {/iJ-Hp Let us 
characterize the sequences of words generated by draw, leading to the genera- 
tion of w, by mean of a rational expression over an alphabet JU {w}: 

K w = {h\f 2 \ ■■■ \f m )*.w. 

Let p x = 7r(x) /ir(£(Q) n ) the probability of emission of any - possibly forbidden 
- word x G £(G)n, then the cumulated probability of the sequences of calls to 
draw, leading to the generation of w, is such that 

ir(w) ir(w) 




□ 



3.1. Complexity analysis: Uniform distribution 

Let us analyze the complexity of Algorithm [2l given C a context-free lan- 
guage, n G N + a positive integer and T C C n a set of forbidden words, assuming 
a uniform distribution on C n . 
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One first remarks that the worst-case time-complexity of the algorithm is 
unbounded, as nothing prevents the algorithm from repeatedly generating the 
same word. An average-case analysis, however, draws a more contrasted picture 
of the time complexity. 

Theorem 3.2. In the uniform distribution, the naive rejection implemented in 
Algorithm^ leads to an average-case complexity in O (( j^^zjy] ) ' ^logfc • draw(n) 
where draw(n) is the complexity of drawing a single word. 

Proof. In the uniform model when T = 0, the number of attempts required by 
the generation of the i-th word only depends on i and is independent from prior 
events. Thus the expected number X n ^ of attempts for k distinct words of size 
n is given by 

k-l , 

where l n :— \C n \ is the number of words of size n in the language and Hi the 



harmonic number of order i, as pointed out by Flajolet et al It follows 

that E(X nj fc) is trivially increasing with k, while remaining upper bounded 
by k ■ Hk £ 9(fclog(fc)) when k = l n (Coupon collector problem). Since the 
expected number of rejections due to a non-empty forbidden set T remains 
the same throughout the generation, and does not have any influence over the 
generated sequences, it can be considered independently and contributes to a 

factor \cH\-wv D 

It follows that, unless the forbidden set dominates the set of words, the per- 
sample complexity of the naive rejection strategy remains largely unaffected (at 
most a factor O(logfc), i.e. £l(n) since k £ f2(|E| n )) by the cumulated cost of 
rejections. 

3.2. Complexity analysis: Weighted languages 

Turning towards weighted context-free languages, one shows that a re- 
jection strategy may have average-case complexity which is exponential on k, 
even in the most favorable case of an empty initial set of forbidden words. 

Proposition 3.3. The generation of k distinct words, starting from an empty 
initial forbidden set J- = 0, may require a number of calls to draw that is 
exponential on k. 

Proof. Consider the following grammar, generating the language denoted by the 
regular expression a* b*: 

S -+a.S\T T -+b.T\s 

We adjoin a weight function 7r to this grammar, such that 7r(6) := a > 1 and 
7r(a) := 1. The probability of any word uj m := a n ~ m b m in the language is 



E^G£(S) 7r ( w ) E"=0 1 
\uj\ — n 
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Now consider the set V n ,k C S„ of words having less than n — k occurrences of 
the symbol b. The probability of generating a word from V n .k is then 

n n-k+i _ i 

p(v„,*) = X! p K-n) = n+1 _ 1 < a- k 

i=o a 

The expected number of generations before generating any element of V ni fc is 
greater than at . Since any non- redundant set of k sequences issued from <S n 
must contain at least one sequence from V n> k, then the average-case time com- 
plexity of a naive rejection approach is in Vt(n ■ a k ), i.e. exponential on k the 
number of words. □ 

However, the above example is based on a regular language, and may not be 
typical of the rejection algorithm's behavior on general context-free languages. 
Indeed, it can be shown that, under a natural assumption, no single word can 
asymptotically contribute a significant portion of the distribution in simple type 
grammars. 

Proposition 3.4. Let = (jr, S, W, S, V) be a weighted grammar of simple 
typqj- Assume that uj^ the most probable (i.e. largest weight w.r.t. n) word of 
length n has weight n(ui^) £ Q(a n ), for some a > 0. 
Then the probability o/w A decreases exponentially as n — >• oo: 

3/3 < I such that P(cj a | tt) = \ , £ fi(jS n ). 

Proof. The Drmota-Lalley- Woods theorem [13, EM EH| establishes that the gen- 
crating function of any simple type grammar has a square-root type singularity. 
This powerful result relies on properties of the underlying system of functional 
equations, and therefore also holds for the coefficients of weighted generating 
functions Q. Therefore the overall weights W n := 7r(£(^ 7r ) n ) - the coeffi- 
cients of the weighted generating function - follow an expansion of the form 
K K y_ (1 + 0(l/n)), a',n' > 0. Since w A is contributing to W n , then one has 

?r(w„) < 7r(^(07r)n) and therefore a < a' . The proposition follows directly from 
taking f3 := a' /a. □ 

Furthermore, one can easily design disconnected grammars such that, for 
any fixed length n, a subset of words M C -C(^jr)n having maximal number 
of occurrences of a given symbol t has total cumulated probability 1 — a", 
< a < 1, in the weighted distribution. It follows that sampling more than 



1 A grammar of simple type is mainly a grammar whose dependency graph is strongly- 
connected and whose number of words follow an aperiodic progression (See Hal for a more 
complete definition). Such a grammar can easily be found for the avatars of the algebraic 
class of combinatorial structures (Dyck words, Motzkin paths, trees of fixed degree,...), all of 
which can be interpreted as trees. 
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Figure 1: Trees of all walks associated with prefix notations of binary trees, having length 
n 6 [1,9] and generated by the BCNF grammar {S-*T\b, T -> a.U, U -> S .S}, under the 
leftmost first derivation policy (f>r, ■ 



\M\ words (e.g. a polynomial number of such words) can be extremely time- 
consuming (typically requiring exponential-time in n). 

Finally, it is worth noticing that, in non-degenerate context-free languages, 
the weight of the least probable word lS^ grows like 0(a n ), a < 1, where the 
exact value of a depends on a subtle trade-off between structural properties 
of the language and its weight function n. In particular, a can become arbi- 
trarily close to 0, by adequately increasing the weight ir(t) of some terminal 
symbols. Sampling k — \C{Q^) n \ words (Coupon Collector) then requires an 
expected Vl(a~ n ) number of calls to draw, since the waiting time of the least 
probable word is clearly a lower-bound for the full collection. Since the number 
of words in a context-free language is bounded by |E| ra and does not depend 
on the weight, then the average cost per generation may grow exponentially on 
n. This observation generalizes to many weighted languages, as shown by in 
Boisberranger et al 11 1. 



4. A step-by-step recursive algorithm 

A common approach to random generators for combinatorial objects [l6l 0] 
consists of treating non-terminal symbols as independent generators. For 
instance, generating from an union-type non-terminal N — > N' .N" , involves 
two independent calls to dedicated generators for N' and N", either directly 
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(Boltzmann sampling), or after figuring out suitable lengths for N' and N" 
(Recursive method). Unfortunately, avoiding a predefined set of words breaks 
the independence assumption. 

For instance, consider an unweighted grammar Q, having axiom N, and 
rules: 

N -> N'.N", N' ->• a | 6, and N" a 6. 

Remark that, starting from either AT' or iV", both the recursive method and 
Boltzmann sampling would chose one of the rules with probability 1/2. Assume 
now that some set J- = {aa} has to be avoided, and that a sequential choice 
of derivations is adopted such that N' is fully derived before taking N" into 
consideration. In this case, the derivation N" — > a must be forbidden iff N' — >• a 
was chosen. Moreover, the probabilities assigned to the derivations of N' must 
reflect the future unavailability of some choices for N" . One possibility is to 
use altered probabilities such that {N' — ^1/3 a,N' —^2/3 b}, an d introduce 
conditional probabilities such that {N" — >q a,N" — >i 6} when N' — > a, and 
{N" -> 1/2 a, N" -> 1/2 b} when N' -> b. 

The idea behind our step-by-step algorithm is to capture this dependency 
sequentially, by considering random generation scenarios as random (parse) 
walks. This perspective allows to determine the total contribution of all for- 
bidden (i.e. previously encountered) words for each of the locally-accessible 
alternatives. These contributions can then be used to modify conditionally the 
precomputed probabilities, leading to an uniform (resp. weighted) generation 
within £(Q) n /J 7 , while keeping the computational cost to a reasonable level. 

4--1- Immature words: A compact description of fixed-length sublanguages 

Let us introduce the notion of immature words, defined as words on both 
the terminal and non-terminal alphabets, where prescribed lengths are addi- 
tionally attached to any occurrence of a symbol. Formally, let Q = (Y,,M,V,S) 
be a context-free grammar, then an immature word is any word 

we£ 4 (£)c((£uA0xN+)*, 

where £^(0) is the set of immature words generated from the axiom S. Such 
words may contain non-terminal symbols, and potentially require some further 
derivations before becoming a word on the terminal alphabet, or mature word. 
Intuitively, immature words correspond to intermediate states in a random gen- 
eration scenario. 

The language associated with an immature word uj is derived from the lan- 
guages of its symbols through 

c(u)= n w 

»e[i,M] 

where C(s) is defined as in Equation [T] with s s A/", and naturally extended on 
terminal symbols t e S through C(t) = {t}. In the following, we use the notation 
tt(u) as a natural shorthand for ir(£(u)) and denote by 7Pf(w) := tt(£(ui) D J- ) 
the total weight of all forbidden words in C(w). 
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^(abaSsSi) — 7r(abaS^Si) 
7r(abaU6) — 7f(abaU[j) 
7r(abaaabbbb) 
7r(abaU6) — 7f(abaU<;) 
oc 7r(£(abaS5Si) — T) 



7r(abaS3S;i) — TrfabaSaSs) 
7r(abaU6) — 7r(abaU6) 
7r(abaabbabb) 
7r(abaU6) — TffabaUe) 
oc 7r(£(abaS3S3) — J 7 ) 



-jrfabaSiSs) — jf(abaSiSs) 
7r(abaU6) — 7f(abaUe) 
7r(ababababb) 
7r(abaU(i) — Tf(abaUfi) 
oc ir(£(abaSiS5) — J 7 ) 



Figure 2: Snapshot of a step-by-step random scenario for the language consisting of prefix 
notations of binary trees of length 6, generated while avoiding T . The step-by-step algorithm 
chooses one out of three possible derivations for a ball 6 using probabilities proportional to the 
overall weights of accessible/admissible words. 



1^.2. Random generation as a random walk in language space 

An atomic derivation, starting from a word to = io' . N . to" £ {£ UA/"}*, is 
the application of a production N — > X to to, that replaces TV by the right-hand 
side X of the production, yielding to to' .X.lu" . Let us call derivation policy 
a deterministic strategy that points, in an immature word, to some non-terminal 
to be rewritten through an atomic derivation. Formally, a derivation policy is 
a function : L(Q) U £ < (5) -> N U {0} such that 

(j) : tut C{G) 

to' € £ < (^) it [1, \lo'\] such that G A/". 

The unambiguity of a grammar requires that any generated word be gen- 
erated by a unique sequence of derivation. A sequence of atomic derivations 
is then said to be consistent with a given derivation policy if the non- 
terminal rewritten at each step is the one pointed by the policy. This notion 
provides a convenient framework for defining the unambiguity of a grammar 
without explicit reference to parse trees. 

Definition 4.1 (Unambiguity). Let Q — {Tt,J\f,V,S) be a context-free gram- 
mar and 4> a derivation policy acting on Q . The grammar Q is said to be un- 
ambiguous if and only if, for each to € £*, there exists at most one sequence 
of atomic derivations that is consistent with <p and produces to from S. 

Any derivation leading to a mature word to € £{G) in & n unambiguous 
grammar Q can then be associated, in a one-to-one fashion, with a walk in the 
space of sublanguages associated with immature words, or parse walk, taking 
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Algorithm 3 Step-by-step random generation algorithm. Q v is a weighted 
grammar, oj an immature word, /i = 7r(w) is the precomputed weight of the 
language generated from oj, and J- C C(lu) is a set of forbidden words. 
StepByStep(w, /i, T, </>) : 

if H < 7f>(a>) then 

return Error 
else if 4>(lo) = then 

return to {lu is a mature word, generation is over} 

5: end if 

(uj',N m ,Uj") <- (^[l, < l 1 (uj)-X\',^<t>(uj),^[<l>(uj)+l,\u:\U 

r <— rand(/i — Wf(uj)) {r is random, uniformly in [0, ir(£(uj) / J 7 ))} 

if N N' | TV" then {Union type} 
At' <- /i • ir(N^)/*{N m ) 

10: r^r-^'-^u/JV^.u/')) 
if r < then 

return StepByStep(w'.iV r '„.w", fj,' , Q % , F) 
else 

return StepByStep(u/.iV£>", pt ■ Tr(N^)/ir(N m ), Q^T) 
15: end if 

else if N ->■ N' . iV" then {Product type} 

for all i G [1, n — 1] do {Boustrophedon order 1, n — 1, 2, n — 2 . . .} 

r<-r-(jn-W 3 :(u'.N!.N£_ i *j")) 
20: if r < then 

return StepByStep(u/ .TV/.A^.u/', ^, Q V ,J) 
end if 
end for 

else if N — > t then {Terminal type} 
25: return StepByStep(o/.i.w", /i, On, J 7 , <j>) 
end if 

Where: rand(x): Draws a random number uniformly in [0, a;) 

7f>(w) := 7r(£(w) n J 7 ): Total weight of forbidden words in C{uj) 



steps consistent with a given derivation policy <fi. More precisely, such a walk 
starts from the axiom S and, for any intermediate immature word X 6 £ < (5), 
the derivation policy <f> points at a position (f>(X), where a non-terminal Xk can 
be found. The parse walk can then be extended using one of the derivations 
acting on Xk (See Figures Q] and [5]) , until a mature word in S* is reached. 

4-3. A step-by-step algorithm 

Let us now describe and validate Algorithmic based on the recursive method 
introduced by Wilf [iij , which uses the concepts of immature words to linearize 
the generation of words. More specifically, the algorithm draws a random word 
through a sequence of local choices (atomic derivations) using probabilities that 
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are proportional to the cumulated weight of accessible non-forbidden words, as 
illustrated by Figure [2] To grant access to such weights in reasonable time, the 
cumulated weights of languages generated by non-terminals are precomputed 
recursively [f|, and a dedicated tree-like data structure is introduced to gain 
efficient access to the contribution of forbidden words. 

Theorem 4.2. Algorithm [3] generates k distinct words of length n from a 
weighted grammar in 0(n ■ \J\f\ + k ■ nlogn) arithmetic operations, while 
storing 0(n ■ \M\ + k) numbers, and a data structure consisting of 0(n • k) 
nodes. 

Proof. As discussed in Section l4~5l Algorithm [3] generates a word in 0{n\og(n)) 
arithmetic operations, assuming that some correcting terms 7Tf(cj) are available 
at runtime. In Section 14.5.21 a data structure is introduced that returns this 
value in 0(log(n)) time. Namely, one has that 0(nlog(n)) times, a search in 
O(log(n)) is performed followed by an arithmetic operation involving large (at 
least polynomial on n, usually exponential) numbers. It follows that the cost 
of accessing the data structure is dominated by the cost of the following arith- 
metic operations, and the overall cost of generating k words is in 0(k ■ n log(n)) 
arithmetic operations. After each generation, the data structure is updated in 
G(n) arithmetic operations, and the complexity is therefore dominated by the 
cost of the generation. 

The precomputation required by the StepByStep algorithm involves 0(n • 
\Af\) arithmetic operations, and the storage of Q(n ■ \Af\) numbers. The data 
structure for 7tjf(w) has Q(n ■ k) nodes and contains O(k) different numbers, 
thus the overall complexity. □ 

4-4- Correctness 

Proposition 4.3. Assuming that fi = tt(uj), Algorithm^ draws a word at ran- 
dom according to the tt -weighted distribution on or returns error iff 
C{u))\J : =0. 

Proof. Let us start with some observations to simplify the proof. First, since 
/i = 7r(w), then the variables \j! and [ii of Algorithm [3] respectively obey 

/ \ AN' m ) n(u)').ir(N m ).*(u>") ■ Tr(JV^) , 
P = ttM • = — = TT CJ .N m .UJ ) 5) 

7r(A m ) ir{N m ) 
ir(N') ■ ir(N" ■) 

fii = 7r(u>) — — = tt(uj .N i .N m _ i .u ). (6) 

Secondly for any immature word w, one has 

7r(w) - njr(u) = tt(£(o;)) - ir(C(u) flJ) = tt(£(uj)\T). 

We now show that, provided that \x — tt{uj) holds, then any word in C(lu) 
is generated with respect to a weighted distribution on £(w)\J 7 . Let d be the 
maximum number of recursive calls needed for the generation of a mature word 
from a given immature word u), then one has: 
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Base: The d = case corresponds to an already mature word w, for which the 
associated language is limited to {w}. In this case, oj has probability 1 in the 
weighted distribution, and is indeed always generated. 

Inductive step: Assuming that the theorem holds for d < n, we investigate 
the probabilities of emission of words that require d = n + 1 derivations. Let 
N m be the non-terminal pointed by <f>, then: 

• N — > N' | N"; Let us first assume that the derivation =>■ N' m is chosen 
with probability 



The recursive call to StepByStep(w'.A r J ' n .oj", //, Q n , T) indeed satisfy fi' — 
tt(l)' .N' m .u>"), and subsequently generates a mature word x using at most n 
recursive calls. The induction hypothesis holds, and the emission probabil- 
ity of x 6 C(u' .N' m .uj")\F is therefore given by n(x) /k(C{uj' .N' m .uj")\F). 
The overall probability of issuing x starting from u) is then 



in which one recognizes the weighted distribution on C(uj)\J-, and the 
argument applies symmetrically to N^. 

• N —> N' . N": A repartition N m =>• N- ' . N^_ { , ie [1, m- 1] is chosen with 
probability 



A recursive call is then made on an immature word oj' .N^.N^^.w" , using 
weight fa. As established in Equation^ one has /a = ^{^' ■N' i .N'^ l _. i: uj"), 
therefore the induction hypothesis applies, and any word x 6 C(uj' .N-.N^_ 
is generated by the recursive call with probability 



The emission probability of x £ C(lu' .N^.N^^.u") from oj is then given 




Tt{C{u'.N> n .J>)\T) 
n{C{u)\T) 



<£(<»)\F) 




K{C{u>.N' i .N'J n _ i .w»)\Ty 



by 



n(C(Lj'.NX_ i .cj")\T) 



■n(C(u'.NiN^_ v u;")\T) 



7r(x) 



n(x) 
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• N — > t: The emission probability for any word x emitted from ui equals 
that of the word issued from u'.t.cj". It is then given by ,., T s = 

7r7ZT§\7) according to the induction hypothesis, which applies since it(cj' .t.u") = 
■k(uj'.N.uj"). 

□ 

4-5. Complexities and data structures 

The overall complexity of Algorithm [3] depends critically on efficient algo- 
rithms and data structures for: 

1. Accessing the weights of languages associated with non-terminals. 

2. Computing the total weight 7f>(w) := tt(£(uj) Pi J 7 ) of all forbidden words 
accessible from an immature word lj. 

3. Investigating the partitions N* n N- . N l l ' n _ i for product rules. 

4. Handling large numbers. 

4-. 5.1. Weights of non-terminals 

As is usual within the recursive approach Q, the total weights n(Ni) of 
languages generated from each non-terminal N must be readily available dur- 
ing the generation at generation time. A precomputation of these numbers can 
be performed in 0(n) arithmetic operations, thanks to the algebraic, there- 
fore holonomic, nature of the weighted counting generating functions. Indeed, 
the coefficients of an holonomic generating function obey a linear recurrence 
with polynomial coefficients in n. Such a recurrence can be algorithmically de- 
termined from the system of functional equations induced by the context-free 
grammar (e.g. using the Maple package GFun (2ll|). 

4-5.2. A data structure for forbidden words 

Proposition 4.4. The total weight Tjr(ui) of all forbidden words generated from 
an immature word uj can be accessed by Algorithm^ in 0(log(n)) time, at the 
cost of an update operation in <d(n) arithmetic operations, while storing 0(|J-"|) 
additional numbers. 

Proof. Let us first remark that, in any BCNF grammar, any parse walk p n that 
produces a mature word of length n, involves 0(n) derivations (i.e. has length 
in 9(n)). To that purpose, let us discuss the number of occurrences of each 
type of rules in p n , by reasoning on the associated parse tree. First, let us 
observe that each letter in the mature word can be bijectively associated with 
the application of a terminal rule, thus p n contains exactly n applications of 
terminal rules. Then, product rules induce a binary structure in the parse tree, 
whose leaves correspond to the n terminal letters. Therefore, p n contains exactly 
n— 1 applications of product rules. Finally, sequences of union-type rules can be 
found before any occurrence of a product or terminal rule. However, it should 
be noted that the unambiguity of the grammar forbids derivations of the form 
N N. The length of any union-type derivations sequence therefore cannot 
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Figure 3: Illustration of the update operation for the weighted tree for forbidden walks, for our 
running example. Initial tree (Left): Each node is associated with an immature word oj and 
its overall weight of forbidden words Wj- (oj) (some unary nodes are contracted for the sake of 
readability). During the execution of Algorithm|3] the tree is traversed to grant efficient access 
to Wjr(uj). Updated tree (Right): After generating a new (mature) word := abaaabbbb, 
the proper suffix of the parse walk is added to the tree (Blue nodes), associated with the 
additional weight ttw^, which must then be propagated back to the root (bold branch), using 
at most 0(n) arithmetic operations. 

exceed \j\f\ + 1. Treating \j\f\ as a constant, the total number of occurrences of 
union rules is then in 0(n), and we conclude that the total number of derivations 
involved in p n is indeed Q(n). 

Assume now that the parse walks of the elements of T are available as a 
set T of sequences of immature words. We introduce a data structure, the 
weighted tree of forbidden walks, a decorated prefix-tree whose nodes are 
in bijection with the set of immature words in T, and such that the overall 
weight 7Tjf(o;) :— tt(£(uj) n F) is attached to each node labeled uj. 

The idea is to descend into the tree during the execution of Algorithm [3J 
simply fetching the precomputed contributions 7tf(w) of forbidden words, that 
are attached to local nodes. Implementation-wise, an argument g is added to 
Algorithm^ (omitted in the pseudocode for the sake of readability), holding the 
node associated with u> if any, or otherwise. One then gets access in O(l) 
operations to the forbidden weight 7fjr(w) of uj, and in 0(log(n)) to that of its 
children nodes Wjrfoj'.N'.L)"), tvf{u' .N" .N'^m"), or Tf T [uj' .N^.N'^.u"), e.g. 
using AVL trees [1| to store the children of a node. Once an atomic derivation 
uj =>• uj' is chosen at random, the suitable child g' of g (or 0, if no word from T 
can be computed from to), is fed to the recursive call. 

A tree update operation must then be performed, as illustrated by Figure|3l 

• First, a top-down stage descends into the tree, ensuring efficient access to 
7F^-(w), until a new mature word Wd is generated. Absent nodes are then 
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added, corresponding to the proper suffix of the parse walk (Blue nodes). 
At each step, one needs to test the presence/absence of a given immature 
word within the children of the current node. Since the degree of a node 
is bounded by 6(n), then this operation can be performed in 0(log(n)) 
time, using a dedicated AVL tree to store the children of a node. The total 
time complexity of a single top-down descent is therefore in 9(nlog(n)) 
basic instructions. 

• Then, a unique new weight 7f>(w<i) is created and attached to the nodes 
in the proper suffix of the parse walk (Blue nodes). A bottom-up stage 
propagates the weight of the generated (mature) word to his ancestors, 
all the way up to the root. The weights associated with branching nodes 
along the path are incremented by lfjr(wd)- Since 0(n) nodes can be found 
from the leaf to the initial immature word S n , then the complexity of this 
stage is at most in 0(n) arithmetic operations. 

Note that the immature words used to label each node do not require an 
explicit encoding (which may otherwise result in a 0(n 2 ) time complexity). 
Indeed, the immature words found on consecutive nodes may only differ on at 
most two positions, owing to the binary nature of products. Therefore, one 
may only store the difference between consecutive immature words, leading to a 
space complexity in 0(|.F| -n) bits. By the same token, the memory requirement 
can be limited to 2 • | J- \ large numbers, by observing that a unary node and its 
unique successor have same value for 7Tf(')j an d that the memory representation 
this number can be shared. □ 

4-5.3. Boustrophedon order for product non-terminals 

For product-type non-terminal rules, one may possibly have to investigate 
0(n) possible candidate partitions of the length, leading to a worst-case com- 
plexity in 0(n 2 ) arithmetic operations. Therefore, we use a Boustrophedon 
order [ju| (l,n — 1, 2, n — 2,...) to investigate possible decompositions N m 
N' i .N'J n _ i . As previously shown [16j, this simple device reduces the total num- 
ber of execution of the body of the innermost loop (Algorithm |3l line 1X8]) to 
0(n log(n)) in the worst case scenario. 

4-5. 4- Arbitrary precision arithmetics 

Although efficient algebraic generators exist even for some classes of tran- 
scendent probabilities [15j, it is reasonable, for all practical purpose, to assume 
that weights are provided as floating point numbers of bounded (yet arbitrarily 
large) precision. Since the language is context-free, the numbers involved in 
the precomputations of iVj and in the tree of forbidden words scale like 0(a n ) 
for some explicit a. It follows that operations performed on such numbers may 
take time 0(n log(n) log log(n)) [22j, while the space occupied by their encoding 
grows like 0(n). 
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5. Non-redundant unranking algorithm 



As an alternative approach, let us propose a weighted unranking algorithm, 
which consists in two distinct parts: 

• An unranking algorithm for generating words from a weighted context free 
grammar, presented in Section [5.1l 

• An algorithm that samples random numbers uniformly within a gapped 
union of intervals, to be used in the unranking algorithm to ensure non- 
redundant generation, presented in Section [5.21 

Our main result is summarized by the following theorem. 

Theorem 5.1. Using an unranking approach, k distinct words of length n can 
be generated from a weighted grammar in 0(n ■ |JV| + k ■ nlogn) arithmetic 
operations, while storing 0(n ■ \M\ + k) large numbers. 

Proof. In Section 15.1.41 we introduce Algorithm 21 a general unranking proce- 
dure which transforms, in O(nlog(n)) arithmetic operations, any random num- 
ber drawn uniformly in the interval [0, 7r(£(Cy 7T )„)[ into a random word in C(Q n ) n 
with respect to a weighted distribution. Furthermore, Section 15.21 introduces a 
dedicated data structure, coupled with Algorithm [5] which draws numbers in 
the subset [0,ir(£(Gir)n)[ while avoiding contributions of forbidden words, and 
uses O(k\og(k)) arithmetic operations. 

The precomputation required by the Algorithm [4] involves 0(n ■ \J\f\) arith- 
metic operations, and a storage of <d(n ■ \M\) numbers. Maintaining the data 
structure used by Algorithm [3] requires the storage of 8(fc) numbers. □ 

5.1. Weighted Unranking algorithm 

Unranking algorithms, formalized by Wilf [3], usually take as input a rank 
in the interval [0, |£|), for \C\ the number of words in a language, and output 
a word from the language that is uniquely associated to this rank according to 
some predefined ordering. It follows that calling an unranking procedure, start- 
ing from a uniformly-generated rank, immediately gives a uniformly generated 
random object. 

Generic unranking algorithms have been proposed for the uniform generation 
of words from a context-free language [l9| . Through grammar transformations 



aiming at the introduction a controlled ambiguity, Weinberg and Nebel [23( ex- 
tended their construct to special cases of non- uniform generation. For the sake of 
self-completeness, we reformulate, and mildly generalize, the above algorithms. 

5.1.1. Statement of the problem 

For a given length n, let us assume a total ordering on the words in C(S) n , 
and denote by w\, . . . , w\c(s}\ the ordered list of words in C(S) n . One can then 
split the interval [0, ir(£(S)){ into \C(S)\ pieces of width tt(wi), n(w2), ■ ■ ■ , 7r(if|£(s)|) 
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respectively, each piece being associated to a particular word. Denoting the j-th 
interval by Ij , one has 



The goal of our generalized unranking is to take as input a number r G 
[0, tt(£(S))[, to figure out the interval Ij = [Lj, Rj[ such that Lj < r < Rj, and 
to return the corresponding word Wk- Upon starting the unranking procedure 
from a uniformly generated random real number in [0, tt(C(S))[, this word is to 
be selected with probability proportional to the width of its interval, i.e. its 
weight. It follows that the whole procedure constitutes a random generation 
algorithm for the weighted probability distribution presented in Equation [2] 

5.1.2. Total ordering for words of length n 

For each non-terminal N 6 TV, let us introduce a dedicated order relation 
• •, defining an arbitrary notion of precedence on C{N) m < n the set of words 
of length m generated from N. For the sake of simplicity, let us write A =4n B 
as a shorthand for a =4n b, V(a, b) £ A x B. The order relation ■ =<; jv ■ is defined 
by w w,Vw 6 C(N) m < n , and recursively defined by: 

• Union type N -> N' \ N" . Then, Vm < n, one has: 



- \/wi,W2 G £{N' m ) (resp. C(N^)), wi =4n w 2 iff wi 4n> w 2 (resp. 
Wi ^N" w%). 

• Product type N -> N'.N". Then, Vm < n, Vj,f £ [1, m - 1], one has: 



• Terminal type N — > t: C(N n ) = {t}, and one has t =4n t. 

Let us then denote by • • := • =<;^ ■ the order induced on the language 
generated by the axiom S of the grammar. It is easily verified that • =^ r • 
constitutes a total order over C(S n ). 

5.1.3. Ranking algorithm 

Let x € R + be a positive real number, and I — [L,R[c R an interval, let 
us overload the sum operator through I + x := [L + x, R + x[ for the sake of 
simplicity. Then an algorithm Rank for computing the ranking interval of any 
word w G C(N) n can be outlined as: 




_fc=i 



k=l 



C(N' m ) ^ N C(N^); 
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Figure 4: Water filling illustration of the ranking/unranking principle for the ■ =^ r ■ order 
in product type non-terminals. Each word w' .w" S C(N' i .N' 1 ! l _ i ) is uniquely associated with 
a rectangular compartment of total area tt(w') ■ tt(w"). The ranking of a word w = w' .to" 
can be adequately compared to the interval on the volume of water (in blue), which upon 
injection in the matrix, partly fills the compartment associated with to, assuming a water flow 
in a left-to-right/top-to-bottom lexicographic order. The unranking stage simply consists in 
searching for the compartment which is partly filled upon injection of a given volume r. 



• Union type N -> N' | N": if w £ N' n then return Rank(w, N' n ). 
Otherwise w £ N„, and return ir(N' n ) + Rank(w, N%). 

• Product type N — > N'.N": Since the grammar is unambiguous, then 
there only exists one decomposition w — w'.w" such that w' £ C(N') and 
w" £ C(N"). Let us then define 

[L',R'[:=ILa.iik(w' ) N( wrl S ] and [L", R"[ := Rank (w", N" w „\ 

As illustrated by Figure 21 the returned interval must then be 

)w'\-l 

[L,R{:= J2 K(Nl.K_ i )+L'.n(NZ_ i ) + L"-Tr(w'),L + 7r(w')-Tr( W ") 



• Terminal type AT — >■ t: Return [0,7r(i)[. 

5.1.4- Unranking algorithm 

Let us now turn to Algorithm 2J which implements unranking for the re- 
lation • ^4 r ■ and mostly consists in inverting the calculation presented in the 
Section [5X31 

Proposition 5.2. Given a real number r £ [0, n{C(Q) n )[ Algorithm^ produces 
the word associated with an interval I, r £ I, in 0(nlog(n)) arithmetic op- 
erations after a precomputation in @(\J\f\ ■ n) arithmetic operations involving 
storage of 0(\J\f\ ■ n) numbers. 
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Algorithm 4 Unranking algorithm. Returns a word w and an interval [//, , Ir [ 



Unrank(A' m , r): 

if iV -4 N' | AT" then {Union type} 
if r < 7r(7V; n ) then 

return Unrank(A^, r) 
else 

5: {w", [IlJr,[) =Unrank(JV£, r - *(N' m )) 
return («/', [7 L + jt(JV^), /« + tt(A^)[) 
end if 

else if TV — > iV'.iV" then {Product type} 
L <- 

10: for all ie [1, m — 1] do 

if tt(JV0 • Tr(N^) < r then 
r^r~ n(N[) ■ tt^.J 
L<-L + w(Ni)-ir(NiU) 
else {Found the right decomposition} 
15: K, = Unrank(A^ 1 ^rr— ) ) 

(w», [L», R'[) = Unrank (jV«_ ( , ^gffi^ ) 
I L = L + L' ■ ir(NZ_i) + L" ■ tt(w') 
Ir = II + ir(v/) ■ tt{w") 
return (w'.w", [II, Ir[) 
20: end if 
end for 

else if N — > t then {Terminal type} 

return (t, [0, 7r(t)]) 
end if 



Sketch of proof. First let us outline a proof of correctness by induction for the 
unranking procedure, starting from the initial case of terminal rules, where the 
algorithm returns the only word t, associated with an interval [0, 7r(t)[. 

In the case of union rules, one either need to remove the added contribution 
7r(A^^) when r > n(N^ n ) before proceeding to unrank within C(N^), or directly 
unrank within C(N' m ) otherwise. 

For products rules, one first remarks that X)l=i _1 n (^l-^n~i) i s exactly the 
quantity computed within L in section 15.1.31 so one is left to ensure that the 
remaining part of r indeed generates its corresponding word. Namely, let us 
assume that w — w'.w" £ £(A^'.A f "_ i ), where w' and w" are associated with 
intervals [L',L' + tt (w')[ in £(N<) and [L", L" + n(w")[ in respectively. 
Therefore the interval associated with w (after subtraction of L) is I = [x, x + 
ir(w') ■ ir(w"){ with x :— L' ■-K{N'^_ i ) + L" ■ ir(w'). Therefore computing, as done 
by Algorithm^ the quantity r' :— r/'K(N'^ l _ i ) for any r £ I gives 

L" / f\ i Tl (L" + n(w"j) . ,. 
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Since L" is a partial sum of the weights in C(N^_ i ), one has < L" < n(N^_ i )~ 
tt(w\c(n" )|) and both bounds are tight (reached by the first and last words). 
It follows that 

L' < r' < V + tt(w') 

in which one recognizes the interval associated with w' within C(N[). The 
recursive unranking on C(N") is given as argument r" := r L which, 
for r £ I, gives 

L" <r" <L" + tt(w") 

in which one recognizes the interval associated with w" within £(7V"). We 
conclude on the correctness of the algorithm by reminding that the unambiguity 
of the grammar prevents multiple parsings (i.e. different intervals) to contribute 
to the generation of a given word. 

The complexity of the algorithm is established by the following observations: 

• The numbers 7r(iV m ) involved in the unranking procedure can be precom- 
puted thanks to the existence of linear recurrences for the coefficients of 
holonomic generating functions, as discussed in Section 14.5.11 They can 
then be precomputed in 0(n) arithmetic operations, requiring storage for 
\J\f \ ■ 8(n) large numbers. 

• The order of investigation of possible decompositions can be modified 
in Algorithm 21 line [10] to adopt a Boustrophedon order as discussed in 
Section 14.5.31 decreasing the worst-case complexity of the algorithm from 
0(n 2 ) to C(nlog(n)) in the worst-case. The total ordering on words can 
then be redefined to account for such a change, and the proof of correctness 
is easily adapted. 

□ 

5.2. Random generation of numbers in gapped intervals 

In the previous section, a simple weighted unranking algorithm was pro- 
posed. Therefore by generating a random number r uniformly in [0,7r(£)[, and 
using the Unranking algorithm, a word w can be generated with respect to 
the weighted distribution over a language C. However when a forbidden set T is 
given, one additionally needs to avoid any interval associated with a forbidden 
word. In other words, one can no longer draw a random number uniformly in 
[0,7r(£)[, but rather in 

Ijr := U w £jrl w = [0, 7r(£)[\(U lDe jr/ tB ). 

Since the intervals I w are mutually disjoint subsets of [0,7r(£)[, a possible 
strategy consists in drawing a random number r € [0, 7r(£) — 7r(J r )[, and incre- 
ment r by some quantity 8 r ^ that sends r + 6 r p into Ip. Considering FigureJSJ 
one observes that S r ^jr can be inductively defined as the total weight of all for- 
bidden words smaller than the word found at r + 8 r ,T- I n general, one could 
order the forbidden words in J- and traverse J- to compute S r ^, but this would 
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Figure 5: Illustration of the shift function <5. In order to avoid any forbidden words, one needs 
to shift rightward a random number r by the total weight of forbidden words (red area) that 
are found leftward. 



Algorithm 5 ModRandom: Takes a uniform random number and a node, 
and returns a uniform random number that avoids any interval associated with 
already generated words. 
ModRandom(r, v) 

if v = then 
return r 

end if 

(72,72, w, [Lw,Rw[,fj,w) <- v 
5: if r < L w — fi w then 
ModRandom(r, 72) 
else 

ModRandom(r + fi Wi + n(wi), 72) 
end if 



induce spending an additional C(|-7-"| ) arithmetic operations per generation. For 
this reason, the intervals of forbidden words are gathered in a balanced binary 
tree structure that grants access to 8 r< jr in 0(log(|.F|)) operations. 

5.2.1. AVL tree for forbidden intervals 

For each (word, interval) pair produced by the unrank algorithm, a corre- 
sponding node is inserted into an AVL tree [l|, i.e. a self-balancing binary search 
tree, whose height after k insertions can be limited to 0(log(fc)) through bal- 
ancing operations. Since the intervals associated with the forbidden set are non 
overlapping, then they can be compared and therefore stored within an AVL 
tree. It follows that the insertion and lookup of k intervals can be performed in 
Q(k\og(k)) comparisons in the worst-case scenario. 

Let us then define recursively our tree as either the empty tree, denoted by 
0, or a 5-tuple v = (Tl,Tr,iv,Iw, ^ w ) where: 
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• Tl and Tr are respectively the left and right children of the tree. Both 
can possibly be empty trees. 

• w and 1^ :— [Lw, Rm[ are a forbidden word and its corresponding interval. 

• is the total weight of forbidden intervals in the left subtree. 

Let us remind that the nodes of an AVL tree are such that any node in a left 
subtree is less than or equal to its root, itself being less than or equal to any 
node of its right subtree. Also let us remark that, upon inserting in a tree v a a 
new word w' ^ w associated with an interval I w > — [L^' , R W '[), the value /j.^,, 
initialized at 0, can be easily updated into a new value such that 



Assuming the tree is correctly built, Algorithm [5] simply descends into the 
tree, and computes 5 r ^jr incrementally. For a given node v = (Tl,Tr, w, I w , /i^), 
the algorithm determines if r corresponds to a word in the interval covered by 
Tl, by comparing r to — /i w the total mass of allowed words in Tl- If 
smaller, then r remains unmodified and the algorithm is run recursively on Tl- 
If greater, then the final interval reached by r is greater than I Wl and fits in the 
right subtree Tr- The value r is then incremented by the total mass fiw + tt(w) 
of forbidden words smaller than Tr, and this value is used within a recursive call 
on Tr- This process is terminated when the empty tree is reached, and the 
current value of r is returned. In other words, the returned value r is distant 
from its original value by the sum of weights fj,w on the left subtrees whose 
intervals are dominated by r, in which one recognizes the definition of S r jr. 

5.2.2. Correctness 

Proposition 5.3. The function ModRandom computed by Algorithm^ is a 
bijection from [0,7r(£) — 7r(J r ))[ onto [0, n(C)[\(U we j^Iw) with uniform density. 

Proof. The outline of the proof is as follows: First we establish a technical 
invariant on the subset of values passed to Algorithm [5j Using this invariant, 
we show that the final value returned by ModRandom avoids every forbidden 
interval, and that any interval can be reached. Let us start with some notations, 
followed by a technical lemma. 

Let Vi be the i-ih node in the tree and let us denote by [a, 6] the 
indices of nodes accessible from Vi. Then let us denote by Hi the interval that 
is dominated by Vi, defined as 



where , i 6 [1, the upper bound (resp. L iBi , i 6 [1, \T\}, the lower bound) 
of the forbidden interval of index i is extended by R^ — (resp. i-ow, = k(C)). 




Ma + If w ' w, i.e. w' is inserted in the left subtree Tl of v 

/i^ Otherwise 



(7) 
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Lemma 5.4. Let Vi = (7l,7r,w, [L^, Ryj i [, be a node in the tree. Then 
the set of values r passed as argument to ModRandom jointly with Vi is exactly 
Hi. 

Proof. Let us prove this claim by induction on the depth D of recursive calls. 
Clearly in the initial call (D — 0), Vi is the root node and Hi is the whole 
interval [0,7r(£) — 7r(7 7 )[ from which r is drawn uniformly, so our claim holds. 
Assume now that the set of possible values for r is exactly Hi :— [Rw a _ 1 , Lyj b+1 — 

Sfe=a ^(^fc)! &t a given depth D — M, then let us investigate the recursive calls. 
Two cases arise, depending on the value of r: 

• When r € A = [Rm a -i , L. ai — /x tBi [, then ModRandom is called on Vj := 
Tl with unmodified value r 1 := r. Thanks to the binary search tree struc- 
ture, the indices of the forbidden nodes on the left subtree are [a, . . . , i — 1] , 
and Hj = [i? tBo _ 1 , L^ - Z)fc= 7r(u)k)[. Since fj,^ = YX^a^i^) ( def -)> 
then Hj = A, and any value r' G Hj can therefore be passed to the 
subsequent call. 

When r e B = [L mi — , L i0b+1 — Y^k=a n (.™ k )li tncn ModRandom 
is called on Vj := Tr with value r' := r + fiw i + ir(wi). The indices 
of the forbidden nodes on the right subtree are [i + 1, . . . , 6], so one has 
Hj = [Rwi , Lw b+1 —J2k=i+i ^(^k) [■ The image B' of the interval B through 



a shift of value fi^ + 7r(iUj) is then 



B 1 



i-l 



k—a 

b 



fc=i+l 



= H j . 



Finally, since r can be any value in B, then any value r' G Hj can be 
passed to ModRandom for some value r G B. 

Consequently at depth D = M+l, the values r' provided to ModRandom over 
a subtree Vj are exactly Hj, and this property therefore holds for any D > 0. □ 

Let us show that forbidden intervals are indeed avoided. Let us consider a 
node Vi = (Tl,Tr,w, [Lw, Rw[, Hw), giving rise to a call ModRandom(r', 0), 
itself returning the final value. Since, for this node, Lemma 15.41 holds, then 
the value passed to this call is any r G [Rat-! , Lw i+1 — 7r(iEj)[. Therefore either 
r < L iBi and r G [R iBi _ 1 , L iL , i [ is returned, or r > and r+7r(u) i ) G [i?^ , L^ i+1 [ 
is returned. It follows that any returned value r' falls between two consecutive 
forbidden intervals (resp. within the ending intervals [0, L^J or \Riu m ,ir(C)\), 
and therefore cannot fall in a forbidden interval. 

Furthermore let us prove that any two calls ModRandom(r', 0) and ModRandom(r 
from Vi and Vj respectively, i ^ j, give rise to distinct intervals. Recall that, as 
pointed out in the previous paragraph, the possibly generated intervals from a 
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node Vi are [Ru,^ , [ if 71 = 0, and , Lw i+1 [ if 7r = 0. Therefore, by 
contradiction, any two calls giving rise to similar intervals would have to involve 
consecutive nodes Vi and Vi+i such that the right subtree Vi is Tr; = and the 
left subtree of is 7lj+i = 0- Since such two nodes would represent con- 
secutive values, then one would appears in a subtree of the other, otherwise the 
first common ancestor Vj of m and Vi+± would be such that Vi < Vj < Vi + 1 and 
the two nodes would not be consecutive. Since < Ui+i, then either Vi would 
be found in the left subtree of Vi+\ (and then 7ij+i 7^ 0), or Wj+j would be 
found in the right subtree of v t (and then Tm ^ 0). Both situations contradict 
the premisses, thus any interval [Ra^ , [, i G [1, {J^l + 1] is generated by at 
most a call over a single empty tree node 0. 

We conclude with the remark that there are exactly | F\ + 1 leaves in a binary 
tree with \T\ inner nodes. Since there are also | J- \ + 1 intervals [Rw i _ 1 , L-an [, i € 
[1, 1 J 7 ) + 1] which are generated by at most one leave, then any such interval is 
generated, and ModRandom is therefore a bijection of [0,7r(£) — 7r(J r )[ into 

Finally, since the map ModRandom involves only shifts and no scaling, it 
follows that the map is measure preserving. Thus the algorithm alters uniformly 
generated random numbers over [0, tt(C) — 7r(J r )[ into uniform random numbers 
over [0,Tr(L)[\(U^[l Wi ). □ 

5.2.3. Complexity considerations 

As can be seen in Equation updating the values fj, v in a tree with m 
nodes can be done in C(log(m)) arithmetic operations upon insertion of a new 
node. However the AVL structure also requires a post-processing consisting 
of C(log(m)) shifts to keep the tree balanced. The shift operation involves 
taking two nodes Vi < Vj that are connected in the tree and switching their 
ancestrality. Namely, if Uj was the first node of the left subtree of Vj, then Vj 
becomes become the first node of the right subtree of v, (and vice- versa). The 
effect of this operation is local, therefore in any pair [vi , vj ) of nodes inverted 
by a shift operation, the values fj, Vi and fi Vj can be updated in 0(1) arithmetic 
operations, and the overall cost of k insertions remains in 0{k log(fc)) arithmetic 
operations. 

Each internal node maintains a possibly large number fj,, therefore 0(|J-"|) 
numbers need be stored in the tree. The ratio of probability between the most 
and least probable structure grows like Q(a n ), therefore at least 0(n) bits needs 
be used for the numbers. 

6. Conclusion and perspectives 

We addressed the random generation of non-redundant sets of sequences 
from context-free languages, while avoiding a predefined set of words. We first 
investigated the efficiency of a rejection approach. Such an approach was found 
to be acceptable in the uniform case. By contrast, for weighted languages, 
we showed that for some languages the expected number of rejections would 
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grow exponentially on the desired number of generated sequences for at least 
two parameters. Furthermore, we showed that in typical context-free languages 
and for fixed length, the probability distribution can be dominated by a small 
number of sequences. We proposed a first algorithm for this problem, based 
on the recursive approach. The correctness of the algorithm was demonstrated, 
and its efficient implementation discussed. This algorithm was showed to per- 
form a non-redundant generation of k distinct structures in O(k-n\og(n)), after 
a precomputation in Q(n ■ \Af\) arithmetic operations, and requires storage of 
0(n ■ \Af \ + k) large numbers, and a data structure consisting of 8(n • k) nodes. 
We explored a second approach, based on a ranking/unranking approach for 
the same task, and obtained an algorithm in 0(n ■ \M\ + k ■ nlogn) complexity, 
with the slightly decreased memory consumption of G(n ■ \Af \ + k) large num- 
bers. These complexities hold in the worst-case scenario, and remain mostly 
unaffected by the magnitude of weights being used. 

6.1. Different impact of fixed-precision arithmetics implementations 

When using arbitrary (or sufficient) precision arithmetics, the complexity 
and storage of the two algorithms are the same. However, practical implemen- 
tations may involve using fixed-precision arithmetic, in which case significant 
differences between the two methods arise. The complexity of both algorithms 
can be improved significantly if one uses fixed-precision arithmetic. However, 
in both cases, the algorithms suffer from a quantifiable loss of precision. 

If the ratio between the weight of the smallest word and the weight of the lan- 
guage is small, then built-in floating point operations may be used, giving some 
advantages to the unranking approach with respect to its memory consumption. 
Indeed, the cost of storing the data structure will then dominate the memory 
consumption of the recursive version (0(n ■ \J\f\ + n ■ k)), while the memory 
complexity of the unranking algorithm gently decreases to (0(n ■ \J\f \ + k j). 

However, we believe the recursive method to be more stable numerically than 
the unranking approach. Indeed, the weights accessible on the alternative choice 
in the usual generation are typically comparable. Therefore, it will typically take 
an large number of generations for the recursive algorithm to fully deplete one of 
the alternatives. By contrast, the unranking algorithm may very quickly isolate 
a poorly contributing set of words after very few generation. For instance, if 
the second word in the ordering is generated first, then the data structure may 
practically forbid the first element, choosing it with probability because of the 
rounding error. This point therefore seems favorable to the recursive algorithm. 

6.2. Perspectives 

Let us briefly outline a few perspectives to the current work: 

• Decomposable structures: One natural extension of the current work 
concerns the random generation of the more general class of decomposable 
structures 16j. Indeed, such aspects like the pointing and unpointing oper- 
ator are not explicitly accounted for in the current work. Furthermore, the 
generation of labeled structures might be amenable to similar techniques 
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in order to avoid a redundant generation. It is unclear, however, how one 
may extend the notion of parse tree in this context. Intrinsic ambiguity 
issues might arise, for instance while using the unranking operator. 

• Non-redundant Boltzmann sampling. Another direction for an ef- 
ficient implementation of the non-redundant generation may rely on an 



extension of Boltzmann samplers [12|. Indeed, the prefix-tree introduced 



by the step-by-step algorithm could, in principle, be used as is to cor- 
rect the probabilities used by Boltzmann sampling. However, it is unclear 
how such a correction may impact the probability of rejection, and conse- 
quently degrade the performances of the resulting algorithm. 

Accommodating general sets of forbidden words. Both the step- 
by-step and unranking algorithms require the preliminary insertion of the 
forbidden set T into a dedicated data structure (prefix tree/AVL tree), 
both requiring the parse trees/walks of any word in T to be available. 
When such an information is not available, one could in principle parse the 
words in T to build the tree. In general this may require T run of a n 3 ~ £ 
parsing algorithm, leading to an impractical (D(n 3 ~ £ ■ complexity. In 
practice, it seems more fruitful to simply run the algorithm starting from 
an empty tree, and to test after each generation if the generated word is 
found in T . If so, reject it after adding its parse walk, available to the 
algorithm without further computation since the word was just created, 
to the tree. Since this update is made at most once for each word in J- , 
then the worst-case complexity of generating k words remains bounded by 
©(IT 7 ! • nlog(n)) arithmetic operations. 
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Appendix A. Expressivity of the binary Chomsky normal form 

Let us show that the assumption of a BCNF can be made without loss of 
generality (or performance). Indeed, it is a classic result that any context-free 
grammar Q can be transformed into a Chomsky Normal Form (CNF) grammar 
that generates the same language. 

Appendix A.l. From CNF to BCNF grammars: An algorithm 

From such a grammar, an equivalent grammar in BCNF can be simply and 
efficiently obtained through the following transformation: 

i) For each terminal t (resp. empty word e) create a new non-terminal N t 
(resp. N s ) whose sole production is N t — > t (rcsp. N t — > e); 

ii) Replace any occurrence of t (resp. e) within a production rule with its 
dedicated non-terminal Nt (resp. N e ); 

hi) Replace any rule N — > N'.N", where N has more than one derivation, 
with rules N — > N* and N* — > N'.N", where N* is a newly created non- 
terminal; 

iv) For any non-terminal N having multiple production rules (N — > Xi | • • • X k , k > 
1), create k — 2 dedicated non-terminals {A^I^Tf 2 , and replace the rules of 

N with a tree-like equivalent hierarchy of binary rules. For instance, one 
may create chained rules, such that N — > X\ \ N\, {Ni — > X i+ \ \ N i+ i}^~^, 
and iV fc _ 2 -> ^fe-i | X k ; 

v) Finally, remove every non-terminal whose sole production is N — » N' , re- 
placing any occurrence of N by N' in any derivation rule. 

Appendix A. 2. Correctness 

The equivalence of the resulting grammar to the input one in CNF trivially 
follows from the language-preserving nature of the substitutions performed at 
each step. Furthermore, it is easily verified that the resulting grammar is in 
BCNF. Indeed, consider the set Vn of derivation rules available for any former 
non-terminal N , along the transformation: 

• Before executing the transformation: Vn consists an arbitrary number of 
terminal rules (N — > N'), binary-product (N — > N' . N") rules, or possibly 
an epsilon rule for the axiom (S — > e); 

• Steps i) and ii) remove terminal symbols: After their execution, Vn con- 
tains an arbitrary number of unary (N —tN') or binary (N — > N' . N") 
rules; 

• Step hi) removes non-unary multiple rules: Vn = {(N — > N' .N")}, or 
V N = {N -> N' | N' e N' C TV}; 

• Step iv) binarizes multiple rules: V N = {(N ->■ AT')}, £V = {(iV ->■ 
JV' . iV")}, or Pjv = {(N -> AT'), (AT ->• JV")}, where AT', A^" e A/"; 
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• Finally, step v) removes extraneous unary non-terminals: Vn = {{N — > 
N' . N")}, or V N = {{N N'), (N N")}, N', N" e TV. 

The derivation rules available for the set of non-terminals, created during the 
transformation, are initially in BCNF. Note that the only modification per- 
formed on productions of new non-terminals substitute a non-terminal for an- 
other, thereby keeping the rules BCNF-compliant. Finally, the constraint on 
the initial CNF guarantees that the only epsilon rule is derived from the axiom, 
either in a single production or through a sequence of non-referential produc- 
tions created at step iv). Therefore one concludes that the produced grammar 
is indeed in BCNF. 

The proposed transformation from a CNF to an equivalent BCNF can be 
implemented in linear time, through a careful ordering of the removals performed 
by step v), and the number of rules is at most increased by a constant factor. 



32 



